Skip to content

COW-591 phase 2 extended prometheus metrics#16

Merged
lgahdl merged 13 commits intojefferson/cow-591-11-prometheus-exportersfrom
jefferson/cow-614-cow-591-phase-2-extended-prometheus-metrics
Feb 26, 2026
Merged

COW-591 phase 2 extended prometheus metrics#16
lgahdl merged 13 commits intojefferson/cow-591-11-prometheus-exportersfrom
jefferson/cow-614-cow-591-phase-2-extended-prometheus-metrics

Conversation

@jeffersonBastos
Copy link
Collaborator

@jeffersonBastos jeffersonBastos commented Feb 6, 2026

Review Focus

Line count is high due to tests and metric declarations. Focus review on: (1) trader index-based cardinality management in exporter.py:200-217 to avoid high-cardinality address labels, (2) API error classification logic in _classify_api_error(), and (3) the (container_name, sample) tuple format change in MetricsStore callbacks.

Summary

Implements Phase 2 of COW-591 Prometheus exporter, adding API performance, container resource, per-trader, and baseline comparison metrics. This completes the full Prometheus metrics deliverable for the grant.

Changes

New Metrics Added

API Metrics

  • cow_perf_api_requests_total - Counter for API requests by endpoint/method/status
  • cow_perf_api_response_time_seconds - Histogram of API response times
  • cow_perf_api_errors_total - Counter for API errors by type (client_error, server_error, timeout, connection_error)

Container Resource Metrics

  • cow_perf_container_cpu_percent - Gauge for container CPU usage
  • cow_perf_container_memory_bytes - Gauge for container memory usage
  • cow_perf_container_network_rx_bytes - Gauge for network bytes received
  • cow_perf_container_network_tx_bytes - Gauge for network bytes transmitted

Per-Trader Metrics

  • cow_perf_trader_orders_submitted_total - Counter for orders submitted by trader index
  • cow_perf_trader_orders_filled_total - Counter for orders filled by trader index
  • cow_perf_traders_active - Gauge for count of currently active traders

Baseline Comparison Metrics

  • cow_perf_baseline_comparison_percent - Gauge for percentage change from baseline
  • cow_perf_regression_detected - Gauge for regression counts by severity
  • cow_perf_regressions_total - Counter for total regressions detected

Code Changes

  • MetricsRegistry: Added 4 new initialization methods for Phase 2 metric categories
  • PrometheusExporter:
    • Added callback handlers for API (_update_api_metrics) and resource (_update_resource_metrics) types
    • Added trader tracking with index-based cardinality management
    • Added manual recording methods for all new metric types
  • MetricsStore: Updated resource callback to pass (container_name, sample) tuple

How to Test

  1. Run the test suite:

    poetry run pytest tests/unit/prometheus/ -v
  2. Start exporter and verify metrics:

    from cow_performance.prometheus.exporter import PrometheusExporter
    
    exp = PrometheusExporter(port=9091, scenario='test')
    exp.start()
    exp.record_api_request('/api/v1/orders', 'POST', 200, 0.15)
    exp.update_container_resources('orderbook', 45.5, 536870912)
    # curl http://localhost:9091/metrics | grep cow_perf_
  3. Run with CLI:

    cow-perf run --prometheus-port 9091 --duration 60

Checklist

  • Tests pass (poetry run pytest tests/unit/prometheus/ - 56 passed)
  • Linting passes (poetry run ruff check .)
  • Type checking passes (poetry run mypy src/cow_performance/prometheus/)
  • Implementation plan documented (thoughts/plans/2026-02-06-cow-591-phase-2-prometheus-exporter.md)

Breaking Changes

Minor: MetricsStore.add_resource_sample() now passes (container_name, sample) tuple to callbacks instead of just sample. This only affects code that registers callbacks for resource metrics.

Related Issues

  • COW-591: Prometheus Exporters (Phase 2 completes this ticket)
  • Enables COW-593: Grafana Dashboards (depends on these metrics)

🤖 Generated with Claude Code

Add API, resource, per-trader, and baseline comparison metrics to
complete COW-591 Prometheus exporter deliverable:

- API metrics: requests counter, response time histogram, errors counter
- Resource metrics: container CPU, memory, network gauges
- Per-trader metrics: orders submitted/filled by trader index
- Comparison metrics: baseline percent change, regression detection

Update MetricsStore to pass container name with resource callbacks.
Add 21 new unit tests covering all Phase 2 functionality.
@linear
Copy link

linear bot commented Feb 6, 2026

@jeffersonBastos jeffersonBastos changed the base branch from develop to jefferson/cow-591-11-prometheus-exporters February 6, 2026 16:51
@jeffersonBastos jeffersonBastos marked this pull request as ready for review February 6, 2026 17:59
jeffersonBastos and others added 11 commits February 10, 2026 11:35
Add two Grafana dashboards for monitoring performance tests:
- Overview dashboard: test progress, order rates, latency distributions
- API Performance dashboard: response times, throughput, error rates

Configure dashboard provisioning via docker-compose volume mount and
add explicit UID to Prometheus datasource for dashboard compatibility.
Add upload_app_data_with_retry() and get_open_order_count() methods
that were missing from the instrumented wrapper, causing AttributeError
when used in place of the underlying OrderbookClient.
Add three new dashboards completing the Grafana visualization suite:

- Resources dashboard: CPU, memory, network monitoring per container
- Comparison dashboard: baseline vs current with regression indicators
- Trader Activity dashboard: per-trader statistics and activity patterns

Update existing dashboards with cross-navigation links to all 5 dashboards.
… COW-593

Document Prometheus exporter phases and Grafana dashboard implementation
plans to track progress on metrics infrastructure work.
- Add prometheus_port config field with default 9091
- CLI uses config default, --prometheus-port 0 to disable
- Enhance order timeout logging with status, age, token pair, lifecycle
- Improve monitoring output with status breakdown counts
- Show all terminal states in final summary (filled/expired/failed/cancelled)
- Update README and CLI docs with monitoring instructions
Add concurrent Prometheus metrics update loop that exports test progress
and throughput metrics every second during performance test runs. This
fixes "No Data" panels in the Overview dashboard.

Remove redundant P50 delta panels from the comparison dashboard and
adjust grid positions for cleaner layout.
- Create 7 core alerting rules (latency, error rate, throughput, resources, test execution)
- Enable rule_files in Prometheus configuration
- Add alerts volume mount in Docker Compose
- Add Grafana annotations to show firing alerts on dashboard
- Add container_memory_percent metric for CriticalMemoryUsage alert
- Add implementation plan: thoughts/plans/2026-02-13-cow-598-alerting-rules.md
- Add implementation notes to ticket file documenting scope decisions
- Update INDEX.md with plan entry and document cluster reference

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…aining-dashboards-resources-comparison

COW-593 task 2 remaining dashboards resources comparison
…ential-dashboards-overview-api

feat(grafana): add performance and API monitoring dashboards
@lgahdl lgahdl merged commit e4b069e into jefferson/cow-591-11-prometheus-exporters Feb 26, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants